A variable-length category-based n-gram language model
نویسندگان
چکیده
A language model based on word-category n-grams and ambiguous category membership with n increased selectively to trade compactness for performance is presented. The use of categories leads intrinsically to a compact model with the ability to generalise to unseen word sequences, and diminishes the spareseness of the training data, thereby making larger n feasible. The language model implicitly involves a statistical tagging operation, which may be used explicitly to assign category assigments to untagged text. Experiments on the LOB corpus show the optimal model-building strategy to yield improved results with respect to conventional n-gram methods, and when used as a tagger, the model is seen to perform well in relation to a standard benchmark.
منابع مشابه
Category-based Statistical Language Models Synopsis
Language models are computational techniques and structures that describe word sequences produced by human subjects, and the work presented here considers primarily their application to automatic speech-recognition systems. Due to the very complex nature of natural languages as well as the need for robust recognition, statistically-based language models, which assign probabilities to word seque...
متن کاملVariable-length category-based n-grams for language modelling
This report concerns the theoretical development and subsequent evaluation of n-gram language models based on word categories. In particular, part-of-speech word classifications have been employed as a means of incorporating significant amounts of a-priori grammatical information into the model. The utilisation of categories diminishes the problem of data sparseness which plagues conventional w...
متن کاملA Succinct N-gram Language Model
Efficient processing of tera-scale text data is an important research topic. This paper proposes lossless compression of N gram language models based on LOUDS, a succinct data structure. LOUDS succinctly represents a trie with M nodes as a 2M + 1 bit string. We compress it further for the N -gram language model structure. We also use ‘variable length coding’ and ‘block-wise compression’ to comp...
متن کاملComparison of part-of-speech and automatically derived category-based language models for speech recognition
To appear in : Proc. ICASSP-98 c IEEE 1998 ABSTRACT This paper compares various category-based language models when used in conjunction with a word-based trigram by means of linear interpolation. Categories corresponding to parts-of-speech as well as automatically clustered groupings are considered. The category-based model employs variable-length n-grams and permits each word to belong to mult...
متن کاملBayesian Variable Order n-gram Language Model based on Pitman-Yor Processes
This paper proposes a variable order n-gram language model by extending a recently proposed model based on the hierarchical Pitman-Yor processes. Introducing a stochastic process on an infinite depth suffix tree, we can infer the hidden n-gram context from which each word originated. Experiments on standard large corpora showed validity and efficiency of the proposed model. Our architecture is ...
متن کامل